In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px

READING THE DATASET

The first step in any machine learning problem is reading the data from a given file format, in this case we have a csv file from where we will read the data.

In [2]:
Application_data=pd.read_csv("googleplaystore.csv")

CLEANING THE DATASET

The dataset will have redundant values like NaN, or some columns will not have any value at all, some columns will have unrelated values, some will be having some special charcters which cannot be feeded to our machine learning model. So these inconsistencies will be resolved in this section using basic python skills and pandas tricks.

In [3]:
Application_data.head()
Out[3]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
In [4]:
Application_data.columns
Out[4]:
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')
In [5]:
Application_data.describe()
Out[5]:
Rating
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000

Lets move from left to right in the columns of the dataset, we start from "RATING" column, and move till "PRICE" column, since these are the numeric columns and are neccessary features for our model. We will do following process on each of these columns:

1)- Checking all the unique values in the column.

2)- If there are some unrelated unique values which are not significant, these will be replaced.

3)- The null check is performed on each numerical column, if null entries are found, they are replaced with the mean values.

4)- The values in the columns contain some special characters, that needs to be removed in order to perform aggregations, so those will be removed like "+" and "," in Installs column, "M", "Varies with Device" and "k" from Size column etc.

5)- The columns that are in object type will be converted to their numerical counterparts for analysis and trends.

6)- Final filtration will be done to make sure there is no inconsistency in any column that may affect the performance of our model.

"RATING" column cleaning by following the 6 steps

In [6]:
#Checking whether there are null values in the Ratings column
nullcheck_ratings=pd.isnull(Application_data["Rating"])
Application_data[nullcheck_ratings]
Out[6]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
23 Mcqueen Coloring pages ART_AND_DESIGN NaN 61 7.0M 100,000+ Free 0 Everyone Art & Design;Action & Adventure March 7, 2018 1.0.0 4.1 and up
113 Wrinkles and rejuvenation BEAUTY NaN 182 5.7M 100,000+ Free 0 Everyone 10+ Beauty September 20, 2017 8.0 3.0 and up
123 Manicure - nail design BEAUTY NaN 119 3.7M 50,000+ Free 0 Everyone Beauty July 23, 2018 1.3 4.1 and up
126 Skin Care and Natural Beauty BEAUTY NaN 654 7.4M 100,000+ Free 0 Teen Beauty July 17, 2018 1.15 4.1 and up
129 Secrets of beauty, youth and health BEAUTY NaN 77 2.9M 10,000+ Free 0 Mature 17+ Beauty August 8, 2017 2.0 2.3 and up
130 Recipes and tips for losing weight BEAUTY NaN 35 3.1M 10,000+ Free 0 Everyone 10+ Beauty December 11, 2017 2.0 3.0 and up
134 Lady adviser (beauty, health) BEAUTY NaN 30 9.9M 10,000+ Free 0 Mature 17+ Beauty January 24, 2018 3.0 3.0 and up
163 Anonymous caller detection BOOKS_AND_REFERENCE NaN 161 2.7M 10,000+ Free 0 Everyone Books & Reference July 13, 2018 1.0 2.3 and up
180 SH-02J Owner's Manual (Android 8.0) BOOKS_AND_REFERENCE NaN 2 7.2M 50,000+ Free 0 Everyone Books & Reference June 15, 2018 3.0 6.0 and up
185 URBANO V 02 instruction manual BOOKS_AND_REFERENCE NaN 114 7.3M 100,000+ Free 0 Everyone Books & Reference August 7, 2015 1.1 5.1 and up
227 Y! Mobile menu BUSINESS NaN 9 1.2M 100,000+ Free 0 Everyone Business April 9, 2018 1.0.5 6.0 and up
321 【Ranobbe complete free】 Novelba - Free app tha... COMICS NaN 1330 22M 50,000+ Free 0 Everyone Comics July 3, 2018 6.1.1 4.2 and up
478 Truth or Dare Pro DATING NaN 0 20M 50+ Paid $1.49 Teen Dating September 1, 2017 1.0 4.0 and up
479 Private Dating, Hide App- Blue for PrivacyHider DATING NaN 0 18k 100+ Paid $2.99 Everyone Dating July 25, 2017 1.0.1 4.0 and up
480 Ad Blocker for SayHi DATING NaN 4 1.2M 100+ Paid $3.99 Teen Dating August 2, 2018 1.2 4.0.3 and up
610 Random Video Chat DATING NaN 3 16M 1,000+ Free 0 Mature 17+ Dating July 15, 2018 4.20 4.0.3 and up
613 Random Video Chat App With Strangers DATING NaN 3 4.8M 1,000+ Free 0 Mature 17+ Dating July 17, 2018 1. 4.0 and up
617 Meet With Strangers: Video Chat & Dating DATING NaN 2 3.7M 500+ Free 0 Mature 17+ Dating July 16, 2018 1. 4.0 and up
620 Ost. Zombies Cast - New Music and Lyrics DATING NaN 1 4.6M 100+ Free 0 Teen Dating July 20, 2018 1.0 4.0.3 and up
621 Dating White Girls DATING NaN 0 3.6M 50+ Free 0 Mature 17+ Dating July 20, 2018 1.0 4.0 and up
623 Geeks Dating DATING NaN 0 13M 50+ Free 0 Mature 17+ Dating July 10, 2018 1.0 4.1 and up
624 Live chat - free video chat DATING NaN 1 8.7M 500+ Free 0 Mature 17+ Dating July 23, 2018 3.52 4.0.3 and up
626 Fishing Brain & Boating Maps Marine DATING NaN 3 6.9M 500+ Free 0 Everyone Dating July 23, 2018 1.0 4.0.3 and up
627 CAM5678 Video Chat DATING NaN 0 39M 500+ Free 0 Mature 17+ Dating July 13, 2018 5.5.8 4.0.3 and up
628 Video chat live advices DATING NaN 0 8.0M 100+ Free 0 Everyone Dating July 10, 2018 1.0 3.0 and up
629 chat live chat DATING NaN 24 3.9M 1,000+ Free 0 Mature 17+ Dating July 26, 2018 1.0 4.0 and up
630 Pet Lovers Dating DATING NaN 0 14M 10+ Free 0 Mature 17+ Dating July 9, 2018 1.0 4.1 and up
631 Friend Find: free chat + flirt dating app DATING NaN 23 11M 100+ Free 0 Mature 17+ Dating July 31, 2018 1.0 4.4 and up
632 Latin Dating DATING NaN 0 13M 10+ Free 0 Mature 17+ Dating July 9, 2018 1.0 4.1 and up
635 Wifi Mingle DATING NaN 0 10.0M 10+ Free 0 Everyone Dating July 27, 2018 1.3 4.4 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10746 FP Opgaver TOOLS NaN 9 61M 1,000+ Free 0 Everyone Tools May 31, 2018 1.9 4.0.3 and up
10748 FP Live COMMUNICATION NaN 0 3.3M 10+ Free 0 Teen Communication November 3, 2017 1.2.4 4.2 and up
10751 FP Market FAMILY NaN 24 44k 1,000+ Free 0 Everyone Education June 17, 2012 1.0 1.6 and up
10759 FP NFC Rewrite TOOLS NaN 17 67k 1,000+ Free 0 Everyone Tools January 29, 2016 1.1 4.0 and up
10761 Greek Bible FP (Audio) BOOKS_AND_REFERENCE NaN 5 8.0M 1,000+ Free 0 Everyone Books & Reference August 29, 2016 1.0.0 4.0.3 and up
10762 The FP Shield NEWS_AND_MAGAZINES NaN 0 11M 10+ Free 0 Everyone News & Magazines February 1, 2018 1.1.0 4.4 and up
10764 FP Transportation AUTO_AND_VEHICLES NaN 1 885k 1+ Free 0 Everyone Auto & Vehicles March 9, 2018 10.0.0 4.0 and up
10769 FQ Magazine LIFESTYLE NaN 1 12M 100+ Free 0 Everyone Lifestyle December 12, 2016 1.0 4.1 and up
10772 FQ Load Board for Transporters BUSINESS NaN 0 3.9M 100+ Free 0 Everyone Business February 16, 2018 1.1.3 5.0 and up
10773 FQ India LIFESTYLE NaN 0 8.9M 10+ Free 0 Everyone Lifestyle July 31, 2018 7.2.2 4.1 and up
10774 Miss FQ NEWS_AND_MAGAZINES NaN 0 36M 10+ Free 0 Everyone News & Magazines April 5, 2018 3.8 4.4 and up
10775 FQ - Football Quiz SPORTS NaN 1 9.0M 1+ Free 0 Everyone Sports May 29, 2018 1.0 5.0 and up
10788 Fountain Live Wallpaper HD – Dubai Wallpaper 3D PERSONALIZATION NaN 1 20M 500+ Free 0 Everyone Personalization April 13, 2018 1.0 4.1 and up
10794 PopStar FAMILY NaN 13 5.7M 1,000+ Free 0 Everyone Casual January 3, 2018 1.6 4.3 and up
10798 Word Search Tab 1 FR FAMILY NaN 0 1020k 50+ Paid $1.04 Everyone Puzzle February 6, 2012 1.1 3.0 and up
10806 SnakeBite911 FR MEDICAL NaN 1 42M 500+ Free 0 Everyone Medical October 9, 2017 1.2 4.1 and up
10807 My FR App TOOLS NaN 2 4.2M 100+ Free 0 Everyone Tools April 9, 2018 1.283.0037 2.3.3 and up
10808 lesparticuliers.fr LIFESTYLE NaN 96 1.0M 50,000+ Free 0 Everyone Lifestyle November 25, 2014 1.5 2.3 and up
10811 FR Plus 1.6 AUTO_AND_VEHICLES NaN 4 3.9M 100+ Free 0 Everyone Auto & Vehicles July 24, 2018 1.3.6 4.4W and up
10813 DICT.fr Mobile BUSINESS NaN 20 2.7M 10,000+ Free 0 Everyone Business July 17, 2018 2.1.10 4.1 and up
10816 FieldBi FR Offline BUSINESS NaN 2 6.8M 100+ Free 0 Everyone Business August 6, 2018 2.1.8 4.1 and up
10818 Gold Quote - Gold.fr FINANCE NaN 96 1.5M 10,000+ Free 0 Everyone Finance May 19, 2016 2.3 2.2 and up
10821 Poop FR FAMILY NaN 6 2.5M 50+ Free 0 Everyone Entertainment May 29, 2018 1.0 4.0.3 and up
10822 PLMGSS FR PRODUCTIVITY NaN 0 3.1M 10+ Free 0 Everyone Productivity December 1, 2017 1 4.4 and up
10823 List iptv FR VIDEO_PLAYERS NaN 1 2.9M 100+ Free 0 Everyone Video Players & Editors April 22, 2018 1.0 4.0.3 and up
10824 Cardio-FR MEDICAL NaN 67 82M 10,000+ Free 0 Everyone Medical July 31, 2018 2.2.2 4.4 and up
10825 Naruto & Boruto FR SOCIAL NaN 7 7.7M 100+ Free 0 Teen Social February 2, 2018 1.0 4.0 and up
10831 payermonstationnement.fr MAPS_AND_NAVIGATION NaN 38 9.8M 5,000+ Free 0 Everyone Maps & Navigation June 13, 2018 2.0.148.0 4.0 and up
10835 FR Forms BUSINESS NaN 0 9.6M 10+ Free 0 Everyone Business September 29, 2016 1.1.5 4.0 and up
10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical January 20, 2017 1.0 2.2 and up

1474 rows × 13 columns

In [7]:
#Replacing the NaN values with the mean rating value
Application_data["Rating"].fillna(value=Application_data["Rating"].mean(),inplace=True)
Application_data["Rating"]
Out[7]:
0        4.100000
1        3.900000
2        4.700000
3        4.500000
4        4.300000
5        4.400000
6        3.800000
7        4.100000
8        4.400000
9        4.700000
10       4.400000
11       4.400000
12       4.200000
13       4.600000
14       4.400000
15       3.200000
16       4.700000
17       4.500000
18       4.300000
19       4.600000
20       4.000000
21       4.100000
22       4.700000
23       4.193338
24       4.700000
25       4.800000
26       4.700000
27       4.100000
28       3.900000
29       4.100000
           ...   
10811    4.193338
10812    4.100000
10813    4.193338
10814    4.000000
10815    4.200000
10816    4.193338
10817    4.000000
10818    4.193338
10819    3.300000
10820    5.000000
10821    4.193338
10822    4.193338
10823    4.193338
10824    4.193338
10825    4.193338
10826    4.000000
10827    4.200000
10828    3.400000
10829    4.600000
10830    3.800000
10831    4.193338
10832    3.800000
10833    4.800000
10834    4.000000
10835    4.193338
10836    4.500000
10837    5.000000
10838    4.193338
10839    4.500000
10840    4.500000
Name: Rating, Length: 10841, dtype: float64
In [8]:
# Checking the unique values in the Rating column,we find there is an inconsistent value of 19.
Application_data["Rating"].unique()
Out[8]:
array([ 4.1       ,  3.9       ,  4.7       ,  4.5       ,  4.3       ,
        4.4       ,  3.8       ,  4.2       ,  4.6       ,  3.2       ,
        4.        ,  4.19333832,  4.8       ,  4.9       ,  3.6       ,
        3.7       ,  3.3       ,  3.4       ,  3.5       ,  3.1       ,
        5.        ,  2.6       ,  3.        ,  1.9       ,  2.5       ,
        2.8       ,  2.7       ,  1.        ,  2.9       ,  2.3       ,
        2.2       ,  1.7       ,  2.        ,  1.8       ,  2.4       ,
        1.6       ,  2.1       ,  1.4       ,  1.5       ,  1.2       ,
       19.        ])
In [9]:
# Replacing the inconsistent value with the mean value of ratings
Application_data["Rating"].replace(19.,4.1,inplace=True)

There was no special character found in the ratings column, so step 4 is not required. Moreover, the datatype of ratings column is already float, so no need for the conversion. So now our Rating column is ready for analysis.

"REVIEWS" column cleaning by following the 6 steps

In [10]:
# Checking the unique values of the number of reviews column, we find there are no unrelated values.
len(Application_data["Reviews"].unique())
Out[10]:
6002
In [11]:
# Checking the Null values of the number of reviews column, we find there are no null values.
nullcheck_reviews=pd.isnull(Application_data["Reviews"])
Application_data[nullcheck_reviews]
Out[11]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
In [12]:
# Checking for any special character that might prevent numeric conversion, 3.0M is replaced with its real value to make the data consistent.
Application_data["Reviews"].replace("3.0M","3000000",inplace=True)
In [13]:
# Finally converting the datatype of Reviews column from Object type(String) to Numeric type(float or int)
Application_data["Reviews"]=pd.to_numeric(Application_data["Reviews"])

All the steps have been completed for the reviews column and it is also ready for the analysis.

"SIZE" column cleaning by following the 6 steps

In [14]:
# Checking for the unique values of the Size column, it is observed it has values appended with M,k and "Varies with device"
Application_data["Size"].unique()
Out[14]:
array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
       '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
       '4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
       '23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
       '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
       '5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
       '6.2M', '18k', '53M', '1.4M', '3.0M', '5.8M', '3.8M', '9.6M',
       '45M', '63M', '49M', '77M', '4.4M', '4.8M', '70M', '6.9M', '9.3M',
       '10.0M', '8.1M', '36M', '84M', '97M', '2.0M', '1.9M', '1.8M',
       '5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M',
       '72M', '43M', '7.7M', '6.3M', '334k', '34M', '93M', '65M', '79M',
       '100M', '58M', '50M', '68M', '64M', '67M', '60M', '94M', '232k',
       '99M', '624k', '95M', '8.5k', '41k', '292k', '11k', '80M', '1.7M',
       '74M', '62M', '69M', '75M', '98M', '85M', '82M', '96M', '87M',
       '71M', '86M', '91M', '81M', '92M', '83M', '88M', '704k', '862k',
       '899k', '378k', '266k', '375k', '1.3M', '975k', '980k', '4.1M',
       '89M', '696k', '544k', '525k', '920k', '779k', '853k', '720k',
       '713k', '772k', '318k', '58k', '241k', '196k', '857k', '51k',
       '953k', '865k', '251k', '930k', '540k', '313k', '746k', '203k',
       '26k', '314k', '239k', '371k', '220k', '730k', '756k', '91k',
       '293k', '17k', '74k', '14k', '317k', '78k', '924k', '902k', '818k',
       '81k', '939k', '169k', '45k', '475k', '965k', '90M', '545k', '61k',
       '283k', '655k', '714k', '93k', '872k', '121k', '322k', '1.0M',
       '976k', '172k', '238k', '549k', '206k', '954k', '444k', '717k',
       '210k', '609k', '308k', '705k', '306k', '904k', '473k', '175k',
       '350k', '383k', '454k', '421k', '70k', '812k', '442k', '842k',
       '417k', '412k', '459k', '478k', '335k', '782k', '721k', '430k',
       '429k', '192k', '200k', '460k', '728k', '496k', '816k', '414k',
       '506k', '887k', '613k', '243k', '569k', '778k', '683k', '592k',
       '319k', '186k', '840k', '647k', '191k', '373k', '437k', '598k',
       '716k', '585k', '982k', '222k', '219k', '55k', '948k', '323k',
       '691k', '511k', '951k', '963k', '25k', '554k', '351k', '27k',
       '82k', '208k', '913k', '514k', '551k', '29k', '103k', '898k',
       '743k', '116k', '153k', '209k', '353k', '499k', '173k', '597k',
       '809k', '122k', '411k', '400k', '801k', '787k', '237k', '50k',
       '643k', '986k', '97k', '516k', '837k', '780k', '961k', '269k',
       '20k', '498k', '600k', '749k', '642k', '881k', '72k', '656k',
       '601k', '221k', '228k', '108k', '940k', '176k', '33k', '663k',
       '34k', '942k', '259k', '164k', '458k', '245k', '629k', '28k',
       '288k', '775k', '785k', '636k', '916k', '994k', '309k', '485k',
       '914k', '903k', '608k', '500k', '54k', '562k', '847k', '957k',
       '688k', '811k', '270k', '48k', '329k', '523k', '921k', '874k',
       '981k', '784k', '280k', '24k', '518k', '754k', '892k', '154k',
       '860k', '364k', '387k', '626k', '161k', '879k', '39k', '970k',
       '170k', '141k', '160k', '144k', '143k', '190k', '376k', '193k',
       '246k', '73k', '658k', '992k', '253k', '420k', '404k', '1,000+',
       '470k', '226k', '240k', '89k', '234k', '257k', '861k', '467k',
       '157k', '44k', '676k', '67k', '552k', '885k', '1020k', '582k',
       '619k'], dtype=object)
In [15]:
# Replacing the "Varies with device" field with NaN entry, so that later on these can be replaced with mean values.
Application_data['Size'].replace('Varies with device', np.nan, inplace = True )
Application_data['Size'].replace('1,000+', np.nan, inplace = True )
In [16]:
# Checking for null values which we will find, since in the above line we have added few null values.
nullcheck_size=pd.isnull(Application_data["Size"])
Application_data[nullcheck_size]
Out[16]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
37 Floor Plan Creator ART_AND_DESIGN 4.100000 36639 NaN 5,000,000+ Free 0 Everyone Art & Design July 14, 2018 Varies with device 2.3.3 and up
42 Textgram - write on photos ART_AND_DESIGN 4.400000 295221 NaN 10,000,000+ Free 0 Everyone Art & Design July 30, 2018 Varies with device Varies with device
52 Used Cars and Trucks for Sale AUTO_AND_VEHICLES 4.600000 17057 NaN 1,000,000+ Free 0 Everyone Auto & Vehicles July 30, 2018 Varies with device Varies with device
67 Ulysse Speedometer AUTO_AND_VEHICLES 4.300000 40211 NaN 5,000,000+ Free 0 Everyone Auto & Vehicles July 30, 2018 Varies with device Varies with device
68 REPUVE AUTO_AND_VEHICLES 3.900000 356 NaN 100,000+ Free 0 Everyone Auto & Vehicles May 25, 2018 Varies with device Varies with device
73 PDD-UA AUTO_AND_VEHICLES 4.800000 736 NaN 100,000+ Free 0 Everyone Auto & Vehicles July 29, 2018 2.9 2.3.3 and up
85 CarMax – Cars for Sale: Search Used Car Inventory AUTO_AND_VEHICLES 4.400000 21777 NaN 1,000,000+ Free 0 Everyone Auto & Vehicles August 4, 2018 Varies with device Varies with device
88 AutoScout24 Switzerland – Find your new car AUTO_AND_VEHICLES 4.600000 13372 NaN 1,000,000+ Free 0 Everyone Auto & Vehicles August 3, 2018 Varies with device Varies with device
89 Zona Azul Digital Fácil SP CET - OFFICIAL São ... AUTO_AND_VEHICLES 4.600000 7880 NaN 100,000+ Free 0 Everyone Auto & Vehicles May 10, 2018 4.6.5 Varies with device
92 Fuelio: Gas log & costs AUTO_AND_VEHICLES 4.600000 65786 NaN 1,000,000+ Free 0 Everyone Auto & Vehicles August 2, 2018 Varies with device 4.0.3 and up
102 Mirror - Zoom & Exposure - BEAUTY 3.900000 32090 NaN 1,000,000+ Free 0 Everyone Beauty October 24, 2016 Varies with device Varies with device
107 Ulta Beauty BEAUTY 4.700000 42050 NaN 1,000,000+ Free 0 Everyone Beauty June 5, 2018 5.4 5.0 and up
109 Selfie Camera BEAUTY 4.200000 17934 NaN 1,000,000+ Free 0 Everyone Beauty September 12, 2017 Varies with device Varies with device
117 Beauty Camera - Selfie Camera BEAUTY 4.000000 113715 NaN 10,000,000+ Free 0 Everyone Beauty August 3, 2017 Varies with device Varies with device
118 Girls Hairstyles BEAUTY 4.100000 3595 NaN 500,000+ Free 0 Everyone Beauty May 24, 2018 Varies with device 4.0 and up
139 Wattpad 📖 Free Books BOOKS_AND_REFERENCE 4.600000 2914724 NaN 100,000,000+ Free 0 Teen Books & Reference August 1, 2018 Varies with device Varies with device
142 Wikipedia BOOKS_AND_REFERENCE 4.400000 577550 NaN 10,000,000+ Free 0 Everyone Books & Reference August 2, 2018 Varies with device Varies with device
143 Amazon Kindle BOOKS_AND_REFERENCE 4.200000 814080 NaN 100,000,000+ Free 0 Teen Books & Reference July 27, 2018 Varies with device Varies with device
144 Cool Reader BOOKS_AND_REFERENCE 4.500000 246315 NaN 10,000,000+ Free 0 Everyone Books & Reference July 17, 2015 Varies with device 1.5 and up
145 Dictionary - Merriam-Webster BOOKS_AND_REFERENCE 4.500000 454060 NaN 10,000,000+ Free 0 Everyone Books & Reference May 18, 2018 Varies with device Varies with device
146 NOOK: Read eBooks & Magazines BOOKS_AND_REFERENCE 4.500000 155446 NaN 10,000,000+ Free 0 Teen Books & Reference April 25, 2018 Varies with device Varies with device
149 FBReader: Favorite Book Reader BOOKS_AND_REFERENCE 4.500000 203130 NaN 10,000,000+ Free 0 Everyone Books & Reference June 28, 2018 Varies with device Varies with device
152 Google Play Books BOOKS_AND_REFERENCE 3.900000 1433233 NaN 1,000,000,000+ Free 0 Teen Books & Reference August 3, 2018 Varies with device Varies with device
157 Spanish English Translator BOOKS_AND_REFERENCE 4.200000 87873 NaN 10,000,000+ Free 0 Teen Books & Reference May 28, 2018 Varies with device Varies with device
162 NOOK App for NOOK Devices BOOKS_AND_REFERENCE 4.700000 19080 NaN 500,000+ Free 0 Everyone Books & Reference April 25, 2018 Varies with device Varies with device
172 Ancestry BOOKS_AND_REFERENCE 4.300000 64513 NaN 5,000,000+ Free 0 Everyone Books & Reference July 31, 2018 Varies with device Varies with device
173 HTC Help BOOKS_AND_REFERENCE 4.200000 8342 NaN 10,000,000+ Free 0 Everyone Books & Reference August 28, 2017 9.00.950462 7.0 and up
179 Moon+ Reader BOOKS_AND_REFERENCE 4.400000 233757 NaN 10,000,000+ Free 0 Everyone Books & Reference May 1, 2018 Varies with device Varies with device
187 Visual Voicemail by MetroPCS BUSINESS 4.100000 16129 NaN 10,000,000+ Free 0 Everyone Business July 30, 2018 Varies with device Varies with device
188 Indeed Job Search BUSINESS 4.300000 674730 NaN 50,000,000+ Free 0 Everyone Business May 21, 2018 Varies with device Varies with device
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10200 Facebook Pages Manager BUSINESS 4.000000 1279800 NaN 50,000,000+ Free 0 Everyone Business August 6, 2018 Varies with device Varies with device
10203 Facebook Ads Manager BUSINESS 4.100000 19051 NaN 1,000,000+ Free 0 Everyone Business August 1, 2018 99.0.0.35.75 4.1 and up
10205 Puffin for Facebook SOCIAL 4.000000 10743 NaN 500,000+ Free 0 Teen Social December 28, 2017 7.0.4.17908 4.1 and up
10218 Messenger Kids – Safer Messaging and Video Chat FAMILY 4.200000 3478 NaN 500,000+ Free 0 Everyone Communication;Creativity August 6, 2018 33.0.0.22.76 4.4 and up
10264 My AEK - Official ΑΕΚ FC app SPORTS 4.800000 3346 NaN 50,000+ Free 0 Everyone Sports January 18, 2018 2.0.2 Varies with device
10281 APOEL FC SPORTS 4.600000 688 NaN 10,000+ Free 0 Everyone Sports January 7, 2015 Varies with device Varies with device
10319 Fire Emblem Heroes FAMILY 4.600000 407694 NaN 5,000,000+ Free 0 Teen Simulation July 19, 2018 2.7.1 4.2 and up
10383 Family Guy The Quest for Stuff GAME 4.000000 995002 NaN 10,000,000+ Free 0 Mature 17+ Adventure July 25, 2018 1.73.0 4.1 and up
10409 studentsLife by FH Kärnten FAMILY 4.400000 108 NaN 5,000+ Free 0 Everyone Education March 30, 2018 4.0.1 4.4 and up
10438 Dolphin and fish coloring book FAMILY 3.900000 2249 NaN 500,000+ Free 0 Everyone Art & Design;Creativity May 15, 2018 Varies with device 4.1 and up
10439 Carpooling FH Hagenberg COMMUNICATION 4.193338 0 NaN 100+ Free 0 Everyone Communication May 18, 2017 Varies with device Varies with device
10447 Talkie - Wi-Fi Calling, Chats, File Sharing COMMUNICATION 4.200000 4838 NaN 500,000+ Free 0 Everyone Communication January 6, 2018 Varies with device Varies with device
10453 Talkie Pro - Wi-Fi Calling, Chats, File Sharing COMMUNICATION 4.500000 201 NaN 1,000+ Paid $2.99 Everyone Communication January 6, 2018 Varies with device Varies with device
10456 Sat-Fi COMMUNICATION 3.600000 97 NaN 5,000+ Free 0 Everyone Communication August 31, 2017 Varies with device Varies with device
10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 4.100000 3000000 NaN Free 0 Everyone NaN February 11, 2018 1.0.19 4.0 and up NaN
10502 Fun Kid Racing - Motocross FAMILY 4.100000 59768 NaN 10,000,000+ Free 0 Everyone Racing;Action & Adventure August 7, 2018 3.53 4.2 and up
10509 PIP Selfie Camera Photo Editor PHOTOGRAPHY 4.400000 156322 NaN 10,000,000+ Free 0 Everyone Photography February 1, 2018 Varies with device Varies with device
10585 FL House PRODUCTIVITY 4.400000 29 NaN 1,000+ Free 0 Everyone Productivity November 22, 2016 1.6.7 Varies with device
10642 WICShopper SHOPPING 3.900000 3023 NaN 500,000+ Free 0 Everyone Shopping July 26, 2018 Varies with device Varies with device
10645 Football Manager Mobile 2018 SPORTS 3.900000 11460 NaN 100,000+ Paid $8.99 Everyone Sports June 27, 2018 Varies with device 4.1 and up
10647 Motorola FM Radio VIDEO_PLAYERS 3.900000 54815 NaN 100,000,000+ Free 0 Everyone Video Players & Editors May 2, 2018 Varies with device Varies with device
10679 Solitaire+ GAME 4.600000 11235 NaN 100,000+ Paid $2.99 Everyone Card July 30, 2018 Varies with device Varies with device
10681 Future Cloud PRODUCTIVITY 4.600000 1075 NaN 100,000+ Free 0 Everyone Productivity January 22, 2018 Varies with device 4.4 and up
10707 Photo Editor Collage Maker Pro PHOTOGRAPHY 4.500000 1519671 NaN 100,000,000+ Free 0 Everyone Photography February 1, 2018 Varies with device Varies with device
10712 Lalafo Pulsuz Elanlar SHOPPING 4.400000 61392 NaN 1,000,000+ Free 0 Everyone Shopping August 8, 2018 Varies with device Varies with device
10713 My Earthquake Alerts - US & Worldwide Earthquakes WEATHER 4.400000 3471 NaN 100,000+ Free 0 Everyone Weather July 24, 2018 Varies with device Varies with device
10725 Posta App MAPS_AND_NAVIGATION 3.600000 8 NaN 1,000+ Free 0 Everyone Maps & Navigation September 27, 2017 Varies with device 4.4 and up
10765 Chat For Strangers - Video Chat SOCIAL 3.400000 622 NaN 100,000+ Free 0 Mature 17+ Social May 23, 2018 Varies with device Varies with device
10826 Frim: get new friends on local chat rooms SOCIAL 4.000000 88486 NaN 5,000,000+ Free 0 Mature 17+ Social March 23, 2018 Varies with device Varies with device
10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.500000 114 NaN 1,000+ Free 0 Mature 17+ Books & Reference January 19, 2015 Varies with device Varies with device

1696 rows × 13 columns

Now we need to replace the NaN values with the mean size of all the applications, but we cannot calculate mean since our column is of Object type String, so we need to convert it into a numeric type. Moreover, we need to remove "M", "k" from the values of the column, since we cannot convert them to numeric without handling these special symbols.

In [17]:
Application_data.Size = (Application_data.Size.replace(r'[kM]+$','', regex=True).astype(float) *
                         Application_data.Size.str.extract(r'[\d\.]+([kM]+)', expand=False).fillna(1).replace(['k','M'], [10**3, 10**6]).astype(int))
In [18]:
# Finally replacing the NaN values with the mean value.
Application_data["Size"].fillna(value="21516530",inplace=True)
In [19]:
# After removing the special characters, lets convert it to numeric data type for finding the mean value.
Application_data["Size"]=pd.to_numeric(Application_data["Size"])

Here we have completed the cleaning of the Size column by following all the 6 steps which were required, since this column was very uncleaned.

"INSTALL" column cleaning by following the 6 steps

In [20]:
# Checking the unique values of the column Installs, we observe that there is a type called "free", which is inconsistent and non numeric, so it should be replaced.
Application_data["Installs"].unique()
Out[20]:
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+', '0+', '0', 'Free'], dtype=object)

We need to remove the "free" with the average number of installs for the applications, but for calculating the average, we need to remove the "+" and "," from the values. After removing them, we will have to convert these into numeric type and then we can calculate the mean and finally substitute the mean value in place of "Free".

In [21]:
# Removing the "+" symbol to make the column numeric.
Application_data["Installs"]=Application_data["Installs"].map(lambda x: x.rstrip('+'))
In [22]:
# Removing the "," from the digits to make it easier.
Application_data["Installs"]=Application_data["Installs"].str.replace(",","")
In [23]:
# There was no null entries found in this column
nullcheck_installs=pd.isnull(Application_data["Installs"])
Application_data[nullcheck_installs]
Out[23]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
In [24]:
# Replacing the inconsistent label value with the mean value of the column.
Application_data["Installs"].replace("Free","15462910",inplace=True)
In [25]:
# Converting the Datatype to the numeric type for analysis
Application_data["Installs"]=pd.to_numeric(Application_data["Installs"])

In this way, we have made our Installs column ready for the analysis by following all the 6 steps again.

"TYPE" column cleaning by following the 6 steps

In [26]:
# Checking for the unique values, we found nan and 0 which should be replaced with Free.
Application_data["Type"].unique()
Out[26]:
array(['Free', 'Paid', nan, '0'], dtype=object)
In [27]:
# Replacing 0 with Free
Application_data["Type"].replace("0","Free",inplace=True)
In [28]:
# Filling the missing values with Free, since most of the applications are free on Google play.
Application_data["Type"].fillna(value="Free",inplace=True)
In [29]:
# Addding the dummy columns for this, so that it can contribute to our model.
dummy_type=pd.get_dummies(Application_data["Type"])
In [30]:
#Concatenating the dummy columns with the main dataframe.
Application_data=pd.concat([Application_data,dummy_type],axis=1)
In [31]:
# Finally dropping the type column.
Application_data.drop(["Type"],axis=1,inplace=True)
In [32]:
Application_data.head()
Out[32]:
App Category Rating Reviews Size Installs Price Content Rating Genres Last Updated Current Ver Android Ver Free Paid
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19000000.0 10000 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up 1 0
1 Coloring book moana ART_AND_DESIGN 3.9 967 14000000.0 500000 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up 1 0
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8700000.0 5000000 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up 1 0
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25000000.0 50000000 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up 1 0
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2800000.0 100000 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up 1 0

In this way we have removed the Type categorical column, used dummy columns to make our feature space more accurate.

"PRICE" column cleaning by following the 6 steps

In [33]:
# By checking the unique values we observe that "Everyone" is an inconsistent value that should be removed.
Application_data["Price"].unique()
Out[33]:
array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',
       '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
       '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
       '$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
       '$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
       '$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
       '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
       '$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
       '$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
       '$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
       '$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
       '$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
       '$394.99', '$1.26', 'Everyone', '$1.20', '$1.04'], dtype=object)

Here to get the mean of the values, the datatype of the column should be numeric and for that to happen we need to remove the dollar symbol from the values and drop the everyone row, since it contains redundant data that will compromise the performance of our model.

In [34]:
# Removing the dollar symbol
Application_data["Price"]=Application_data["Price"].map(lambda x: x.lstrip('$'))
In [35]:
# Removing the non essential row value.
Application_data.drop(Application_data[Application_data["Price"] == "Everyone"].index, inplace=True)
In [36]:
# By checking there were no null values found
nullcheck_Prices=pd.isnull(Application_data["Price"])
Application_data[nullcheck_Prices]
Out[36]:
App Category Rating Reviews Size Installs Price Content Rating Genres Last Updated Current Ver Android Ver Free Paid
In [37]:
# Finally converting to numeric type for analysis
Application_data["Price"]=pd.to_numeric(Application_data["Price"])

We have cleaned the Price column by following all the 6 steps as per the requirement, now this column is ready for the analysis.

"CATEGORY" column cleaning by following the 6 steps

In [38]:
# Checking the unique values, we found 
Application_data["Category"].unique()
Out[38]:
array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION'],
      dtype=object)
In [39]:
Application_data["Category"].replace("1.9","MISCELLANEOUS",inplace=True)
In [40]:
# Checking for null values, there were no null values found for this column
nullcheck=pd.isnull(Application_data["Category"])
Application_data[nullcheck]
Out[40]:
App Category Rating Reviews Size Installs Price Content Rating Genres Last Updated Current Ver Android Ver Free Paid

For this column, we will perform the label encoding and not the dummies, since by making dummies there will be too many extra columns added to our feature matrix that is not required, so label encoding is done by providing numerical values to each and every category of application.

In [41]:
# Importing the required library
from sklearn.preprocessing import LabelEncoder
In [42]:
# Instantiating the encoder
labelencoder2 = LabelEncoder()
In [44]:
#Encoding the Ctegory column using scikit learn
Application_data['Categories_encoded'] = labelencoder2.fit_transform(Application_data['Category'])
In [45]:
# finally dropping the type column, since it is already splitted.
Application_data.drop(["Category"],axis=1,inplace=True)
In [46]:
Application_data.head()
Out[46]:
App Rating Reviews Size Installs Price Content Rating Genres Last Updated Current Ver Android Ver Free Paid Categories_encoded
0 Photo Editor & Candy Camera & Grid & ScrapBook 4.1 159 19000000.0 10000 0.0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up 1 0 0
1 Coloring book moana 3.9 967 14000000.0 500000 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up 1 0 0
2 U Launcher Lite – FREE Live Cool Themes, Hide ... 4.7 87510 8700000.0 5000000 0.0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up 1 0 0
3 Sketch - Draw & Paint 4.5 215644 25000000.0 50000000 0.0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up 1 0 0
4 Pixel Draw - Number Art Coloring Book 4.3 967 2800000.0 100000 0.0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up 1 0 0

"CONTENT RATING" Column cleaning by 6 Steps

For this categorical column also, we are doing the label encoding similarly we did for the Category column.

In [47]:
# Checking for unique values
Application_data["Content Rating"].unique()
Out[47]:
array(['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+',
       'Adults only 18+', 'Unrated'], dtype=object)
In [48]:
# Null check for Content Rating
nullcheck_contentrating=pd.isnull(Application_data["Content Rating"])
Application_data[nullcheck_contentrating]
Out[48]:
App Rating Reviews Size Installs Price Content Rating Genres Last Updated Current Ver Android Ver Free Paid Categories_encoded
In [50]:
# importing the required package
from sklearn.preprocessing import LabelEncoder
In [51]:
#instantiating the encoder
labelencoder = LabelEncoder()
In [52]:
# encoding the column
Application_data['Content_Rating_encoded'] = labelencoder.fit_transform(Application_data['Content Rating'])
In [53]:
# finally removing the content ratig column after encoding
Application_data.drop(["Content Rating"],axis=1,inplace=True)
In [54]:
Application_data.head()
Out[54]:
App Rating Reviews Size Installs Price Genres Last Updated Current Ver Android Ver Free Paid Categories_encoded Content_Rating_encoded
0 Photo Editor & Candy Camera & Grid & ScrapBook 4.1 159 19000000.0 10000 0.0 Art & Design January 7, 2018 1.0.0 4.0.3 and up 1 0 0 1
1 Coloring book moana 3.9 967 14000000.0 500000 0.0 Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up 1 0 0 1
2 U Launcher Lite – FREE Live Cool Themes, Hide ... 4.7 87510 8700000.0 5000000 0.0 Art & Design August 1, 2018 1.2.4 4.0.3 and up 1 0 0 1
3 Sketch - Draw & Paint 4.5 215644 25000000.0 50000000 0.0 Art & Design June 8, 2018 Varies with device 4.2 and up 1 0 0 4
4 Pixel Draw - Number Art Coloring Book 4.3 967 2800000.0 100000 0.0 Art & Design;Creativity June 20, 2018 1.1 4.4 and up 1 0 0 1
In [55]:
# Checking the datatypes of the columns to ensure that we have successfully gathered all the numerical columns.
Application_data.dtypes
Out[55]:
App                        object
Rating                    float64
Reviews                     int64
Size                      float64
Installs                    int64
Price                     float64
Genres                     object
Last Updated               object
Current Ver                object
Android Ver                object
Free                        uint8
Paid                        uint8
Categories_encoded          int32
Content_Rating_encoded      int32
dtype: object
In [56]:
# Finding the mean of all the numerical columns
Application_data.mean()
Out[56]:
Rating                    4.191972e+00
Reviews                   4.441529e+05
Size                      2.151653e+07
Installs                  1.546434e+07
Price                     1.027368e+00
Free                      9.261993e-01
Paid                      7.380074e-02
Categories_encoded        1.672537e+01
Content_Rating_encoded    1.465037e+00
dtype: float64

EXPLORATORY DATA ANALYSIS

Below there is a complete analysis of various relationships between the features of our data. This is required so that we can understand what all features will play a significant role when predicting the number of installs for any application.

In [57]:
sns.pairplot(Application_data)
Out[57]:
<seaborn.axisgrid.PairGrid at 0x1dbd630a208>

Here a pairplot is shown between all the numerical columns of the data. This gives a high level of intuition between the relationships between the various features. Now firstly, histograms will be drawn for all the numerical columns just to know thier counts and distribution. Plotly is used here for graphical representations.

In [58]:
colorassigned=Application_data["Rating"]
fig = px.histogram(Application_data, x="Rating", marginal="rug",
                   hover_data=Application_data.columns,nbins=30,color=colorassigned)
fig.show()

The above Graph is an Histogram, that shows the distribution of ratings of various android applications. The histogram is divided into colors based on the values of the rating. The color scale is given on the right side. The count of rating 4.1 is maximum(1474) as can be found by hovering on the graph. Moreover, the count of rating uniformly increases from 3.4(128) to going to maximum of 4.1(1474), and then again going up and down. This means most of the applications on google play have their ratings between 4 to 4.5.

In [59]:
fig = px.histogram(Application_data, x="Reviews", marginal="rug",
                   hover_data=Application_data.columns,nbins=30)
fig.show()

This ia an histogram to show the distribution of number of reviews for each application. It is quite clearly visible that, 90% of the applications on Google play store have reviews less than 5 million. 138 applications have reviews between 5 Million to 10 Million. Only 47 android applications have reviews between 10 Million to 15 Million. So majority of the applications have less then 5 Million reviews.

In [60]:
colorassigned=Application_data["Size"]
fig = px.histogram(Application_data, x="Size", marginal="rug",
                   hover_data=Application_data.columns,nbins=30,color=colorassigned)
fig.show()

The above Graph is an Histogram, that shows the distribution of size of various android applications. It can be observed that most of the applications have lesser size, since as the size increases on the x-axis, the bars are getting shorter and shorter, which means the count of such types of apps is decreasing. So we have more applications on Google playstore that are smaller in size than the larger ones. Most of the applications have the size of around 21.5 MB.

In [61]:
colorassigned=Application_data["Installs"]
fig = px.histogram(Application_data, x="Installs", marginal="rug",
                   hover_data=Application_data.columns,nbins=30,color=colorassigned)
fig.show()

The above graph shows the number of installs of the android applications. It can be observed that majority of the applications have less than 10 Million installs. Moreover, there are only 58 applications that have more than 1 billion installs on Google play.

In [62]:
colorassigned=Application_data["Price"]
fig = px.histogram(Application_data, x="Price", marginal="rug",
                   hover_data=Application_data.columns,nbins=30,color=colorassigned)
fig.show()

This histogram shows the price distribution of various android applications on Google play. Majority of the applications are free of cost. There are 12 android applications that are the most expensive costing 400 bucks.

With this we have completed the individual analysis of all the numerical columns of our dataset. Now we will find the relation between each column to analyse deeply. The step that is followed below is:

1)- Calculate the correlation value and draw a heatmap to know the correlation between different columns.

2)- Once we find the correlation, then we know which columns are affecting one another, then we start plotting columns in pair of two based on thier correlation values. If the correlation is negative or very less, there is no point in plotting those columns.

3)- After plotting we will fit a linear regression line to our data points. The more the correlation value, better line of fit we get.

In [63]:
# Calculating the Correlation and plotting the heatmap to know the relations.
cors=Application_data.corr()
fig = px.imshow(cors,labels=dict(color="Pearson Correlation"), x=['Rating', 'Reviews', 'Size', 'Installs', 'Price','Paid','Free','Content_Rating_encoded','Categories_encoded'],
                y=['Rating', 'Reviews', 'Size','Installs','Price','Paid','Free','Content_Rating_encoded','Categories_encoded'])
fig.show()

Following inferences can be drawn from this heatmap:

CORRELATION VALUE FEATURES INVOLVED VERDICT

-0.020 Price vs Rating No Correlation

-0.009 Price vs Reviews No Correlation

-0.022 Price vs Size No Correlation

-0.011 Price vs Installs No Correlation

0.051 Installs vs Rating No Correlation

0.643 Installs vs Reviews Great Correlation

0.082 Installs vs Size No Correlation

-0.011 Installs vs Price No Correlation

0.074 Size vs Rating No Correlation

0.128 Size vs Reviews Very less Correlation

0.082 Size vs Installs No Correlation

-0.022 Size vs Price No Correlation

0.067 Reviews vs Rating No Correlation

We will be plotting only those relations whose correlation value is greater than 0.1, rest all do not have any correlation, so plotting will not be fruitful.

In [64]:
# Plotting scatter plot with a line of fit between Installs and Reviews, these two have the highest Correlation between them.
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Application_data["Installs"],Application_data["Reviews"])
colorassigned=Application_data["Reviews"]
fig = px.scatter(Application_data, x="Installs", y="Reviews",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.643
P-value: 0.00000000

It is observed that we have a good fit to the data points, since the correlation between these 2 columns is significant. As it is visible, as the number of installs increases, the number of reviews are also increasing which makes sense, since if the user has installed the application, then only they can give feedback to it. Without using an application, reviews cannot be given. If we get a new data point, we can predict its number of installs based on the number of reviews. By hovering on the red line, the equation of the straight line can be seen. Hovering on each data point gives its installs and reviews at that point.

In [65]:
# Plotting scatter plot with a line of fit between Rating and Reviews, these two have very less correlation between them. 
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Application_data["Rating"],Application_data["Reviews"])
colorassigned=Application_data["Reviews"]
fig = px.scatter(Application_data, x="Rating", y="Reviews",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.068
P-value: 0.00000000

As can be observed from this graph, we can see that the applications that have ratings between 4 to 4.7 have maximum number of reviews. However, we cannot say that as the ratings increases the reviews increases, this happens just for a particular range of 4 to 4.7 where Reviews increase as the ratings increase but before 4 and after 4.7 there is different trend. It is observed that after rating 4.7, the count of number of reviews have reduced, that is the applications to which review is given is reduced. The apps having 5 star rating have only 4 reviews. However, the apps having rating less then 4 have been rated by many users.

In [66]:
# Plotting scatter plot with a line of fit between Size and Reviews, these two have very less correlation between them. 
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Application_data["Size"],Application_data["Reviews"])
colorassigned=Application_data["Reviews"]
fig = px.scatter(Application_data, x="Size", y="Reviews",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.128
P-value: 0.00000000

There is no general trend observed in this graph, since there is very less correlation observed in these two columns. There are applications of size 21 MB getting 80 Million reviews, and there are applications with larger size like 98 MB, getting 45 million reviews. So there is no trend observed here.

In [67]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Application_data["Installs"],Application_data["Categories_encoded"])
colorassigned=Application_data["Categories_encoded"]
fig = px.scatter(Application_data, x="Installs", y="Categories_encoded",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.023
P-value: 0.01898093

MODEL BUILDING AND EVALUATION USING SKLEARN

The final step is creating the model that will predict the number of installs for an android application on Google play. We will be using 3 regressors for this purpose, Linear, DecisonTree and RandomForest. Finally the performance for all the 3 will be compared in the graphical format.

LINEAR REGRESSOR

In [68]:
# Splitting the target variable and the feature matrix
X=Application_data[["Reviews","Size","Rating","Price","Paid","Free","Categories_encoded","Content_Rating_encoded"]]
y=Application_data["Installs"]
In [69]:
# importing train test set
from sklearn.model_selection import train_test_split
In [70]:
# splitting the training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [71]:
# importing linear regressor
from sklearn.linear_model import LinearRegression
In [72]:
# Instantiating linear regressor
lm=LinearRegression()
In [73]:
# Fitting the model
lm.fit(X_train,y_train)
Out[73]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [74]:
# making predictions on the test set
predictions=lm.predict(X_test)
In [75]:
# displaying predictions
predictions
Out[75]:
array([10008655.20346311, 18375503.19627455,  4185440.25093459, ...,
       10957590.51636698,  4178834.52202947, 10464112.03923625])
In [76]:
# Accuracy score for Linear regressor
linearregressionscore=lm.score(X_test,y_test)
linearregressionscore
Out[76]:
0.49541896184862166
In [77]:
# The coefficient for Linear regressor per feature.
lm.coef_
Out[77]:
array([ 1.97371877e+01,  2.75901845e-04,  1.17336412e+06,  3.38586456e+03,
       -3.87298763e+06,  3.87298763e+06,  2.29078396e+05,  9.46738964e+05])

Evaluating the metrics for linear regresssor, the mean absolute error, mean squared error and finally root mean square error.

In [78]:
# Importing the metrics
from sklearn import metrics
In [79]:
# Mean absolute error on test data
metrics.mean_absolute_error(y_test,predictions)
Out[79]:
14428100.343854
In [80]:
# Mean squared error on test data
metrics.mean_squared_error(y_test,predictions)
Out[80]:
3338605858668020.0
In [81]:
# Root mean squared error on test data
rmelinear=np.sqrt(metrics.mean_absolute_error(y_test,predictions))
rmelinear
Out[81]:
3798.433933064257

DECISION TREE REGRESSOR

In [82]:
# Defining the feature matrix and the target variable
X=Application_data[["Reviews","Size","Rating","Price","Paid","Free","Categories_encoded","Content_Rating_encoded"]]
y=Application_data["Installs"]
In [83]:
# Importing the train test split
from sklearn.model_selection import train_test_split
In [84]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [85]:
# Importing the regressor
from sklearn.tree import DecisionTreeRegressor
In [86]:
# Instantiating the regressor
decisiontreereg=DecisionTreeRegressor()
In [87]:
# Fitting the model
decisiontreereg.fit(X_train,y_train)
Out[87]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
In [88]:
# Gettting the predicted values
y_prediction=decisiontreereg.predict(X_test)
In [89]:
# The accuracy score for decision tree regressor
decisiontreescore=decisiontreereg.score(X_test,y_test)
decisiontreescore
Out[89]:
0.8981496501463132
In [90]:
from sklearn import metrics
In [91]:
# Mean absolute error
metrics.mean_absolute_error(y_test,y_prediction)
Out[91]:
3082997.210373104
In [92]:
# Root mean square error
rmetree=np.sqrt(metrics.mean_absolute_error(y_test,y_prediction))
rmetree
Out[92]:
1755.846579394995

RANDOM FOREST REGRESSOR

In [122]:
# Separating the feature matrix and target variable
X=Application_data[["Reviews","Size","Rating","Price","Paid","Free","Categories_encoded","Content_Rating_encoded"]]
y=Application_data["Installs"]
In [123]:
# Importing the train test split
from sklearn.model_selection import train_test_split
In [124]:
# Splitting the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [125]:
# Importing the random forest regressor
from sklearn.ensemble import RandomForestRegressor
In [126]:
# Instantiating with giving the value of number of sub trees to be created.
Randomforestreg=RandomForestRegressor(n_estimators = 100,n_jobs = -1,oob_score = True, bootstrap = True,random_state=42)
In [127]:
# fitting the model
Randomforestreg.fit(X_train,y_train)
Out[127]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                      oob_score=True, random_state=42, verbose=0,
                      warm_start=False)
In [128]:
# Predicting the number of installs
y_prediction_randomforest=Randomforestreg.predict(X_test)

It is very fruitful if we know how much significant a feature is for predicting our target variable. Below is the plot showing the importance of various features in predicting the number of installs.

In [129]:
# Using barplot from seaborn to show importance of features in sorted manner.
feature_imp=pd.DataFrame(sorted(zip(Randomforestreg.feature_importances_,Application_data[["Reviews","Size","Rating","Price","Paid","Free","Categories_encoded","Content_Rating_encoded"]])),columns=["Significance","Features"])
fig=plt.figure(figsize=(6,6))
sns.barplot(x="Significance",y="Features",data=feature_imp.sort_values(by="Significance",ascending=False),dodge=False)
plt.title("Important features for predicting the number of installs")
plt.tight_layout()
plt.show()
In [130]:
from sklearn.metrics import r2_score,mean_squared_error
In [131]:
# The performance of random forest.
print('R^2 Training Score: {:.2f} \nOOB Score: {:.2f} \nR^2 Validation Score: {:.2f}'.format(Randomforestreg.score(X_train, y_train), 
                                                                                             Randomforestreg.oob_score_,
                                                                                             Randomforestreg.score(X_test, y_test)))
R^2 Training Score: 0.98 
OOB Score: 0.87 
R^2 Validation Score: 0.88
In [132]:
# Accuracy score for random forest
randomforestscore=Randomforestreg.score(X_test,y_test)
randomforestscore
Out[132]:
0.8823528634024886
In [133]:
# Importing the performance metrics
from sklearn import metrics
In [134]:
# Mean absolute error
metrics.mean_absolute_error(y_test,y_prediction_randomforest)
Out[134]:
5018571.900065896
In [135]:
# Root mean squared error
rmerandom=np.sqrt(metrics.mean_absolute_error(y_test,y_prediction_randomforest))
rmerandom
Out[135]:
2240.2169314746943

The final step to compare our models accuracy for this task. We will be comparing the accuracy score and the root mean square error for all the 3 models. For that to achieve, we need to create a dataframe that consists of the accuracy score and root mean square error for each model and then we can plot that dataframe.

In [136]:
# Creating the dataframe that has accuracy score and root mean squared error for all the 3 models.
dict={"Linear Regressor":[linearregressionscore,rmelinear],"DecisionTree Regressor":[decisiontreescore,rmetree],"RandomForest Regressor":[randomforestscore,rmerandom]}
df_comparison_models=pd.DataFrame(dict,["Score","Root Mean Square Error"])
In [137]:
df_comparison_models.head()
Out[137]:
Linear Regressor DecisionTree Regressor RandomForest Regressor
Score 0.495419 0.898150 0.882353
Root Mean Square Error 3798.433933 1755.846579 2240.216931
In [138]:
# Plotting the accuracy of all the 3 models
%matplotlib inline
model_accuracy = pd.Series(data=[linearregressionscore,decisiontreescore,randomforestscore], 
        index=['Linear_Regressor','DecisionTree Regressor','RandomForest Regressor'])
fig= plt.figure(figsize=(8,8))
model_accuracy.sort_values().plot.barh()
plt.title('Model Accuracy')
Out[138]:
Text(0.5, 1.0, 'Model Accuracy')
In [139]:
# Plotting the Root Mean Squared Error comaparison
%matplotlib inline
model_accuracy = pd.Series(data=[rmelinear,rmetree,rmerandom], 
        index=['Linear_Regressor','DecisionTree Regressor','RandomForest Regressor'])
fig= plt.figure(figsize=(8,8))
model_accuracy.sort_values().plot.barh()
plt.title('Model Root Mean Squared Error')
Out[139]:
Text(0.5, 1.0, 'Model Root Mean Squared Error')

FINAL THOUGHTS

This was all about this dataset, where we performed all the process from the scratch. We loaded the data, cleaned the features, did a thorough exploratory data analysis to understand which will be the key features that will be vital for predicting our target variable, finally created the models and made some predictions. At the last compared the accuracy of all the models on which the analysis was performed.